Deep evolutionary fusion neural network: a new prediction standard for infectious disease incidence rates

Background Previously, many methods have been used to predict the incidence trends of infectious diseases. There are numerous methods for predicting the incidence trends of infectious diseases, and they have exhibited varying degrees of success. However, there are a lack of prediction benchmarks that integrate linear and nonlinear methods and effectively use internet data. The aim of this paper is to develop a prediction model of the incidence rate of infectious diseases that integrates multiple methods and multisource data, realizing ground-breaking research. Results The infectious disease dataset is from an official release and includes four national and three regional datasets. The Baidu index platform provides internet data. We choose a single model (seasonal autoregressive integrated moving average (SARIMA), nonlinear autoregressive neural network (NAR), and long short-term memory (LSTM)) and a deep evolutionary fusion neural network (DEFNN). The DEFNN is built using the idea of neural evolution and fusion, and the DEFNN + is built using multisource data. We compare the model accuracy on reference group data and validate the model generalizability on external data. (1) The loss of SA-LSTM in the reference group dataset is 0.4919, which is significantly better than that of other single models. (2) The loss values of SA-LSTM on the national and regional external datasets are 0.9666, 1.2437, 0.2472, 0.7239, 1.4026, and 0.6868. (3) When multisource indices are added to the national dataset, the loss of the DEFNN + increases to 0.4212, 0.8218, 1.0331, and 0.8575. Conclusions We propose an SA-LSTM optimization model with good accuracy and generalizability based on the concept of multiple methods and multiple data fusion. DEFNN enriches and supplements infectious disease prediction methodologies, can serve as a new benchmark for future infectious disease predictions and provides a reference for the prediction of the incidence rates of various infectious diseases.


Background
The incidence of infectious diseases has always been a public health problem worthy of attention and can cause very large social, economic and health burdens [1].Using data on the numbers of infectious cases to predict the trends of infectious diseases can provide a focus and direction for the actual prevention and control of infectious diseases and can also allow for the evaluation of epidemic prevention effects and long-term outcomes [2].However, previous research on infectious disease prediction has mainly focused on a single disease and a single region, and there have been few systematic studies on multiple national and regional diseases [3][4][5].Therefore, in this paper, the effectiveness and universality of the incidence rates of infectious diseases in nationwide and regional epidemics are studied.
At present, there are many problems, such as poor model accuracy and weak generalization performance, in the field of infectious disease trend prediction [6].There are many kinds of time series prediction models for infectious diseases, and there is no systematic, unified and standardized research or modelling approach.Previous studies often use time series methods in the prediction of incidence rates of infectious diseases.
Autoregressive integrated moving average (ARIMA) uses the historical values of a univariate time series to predict future values and is suitable for processing stationary data with linear trends.On this basis, the seasonal ARIMA (SARIMA) model was also developed.SARIMA fully considers seasonal information to effectively predict seasonal infectious diseases [7].However, SARIMA presupposes the basic time series to be linear, so it is not suitable for analysing data containing nonlinear time series [8].To solve this problem, nonlinear machine learning models represented by artificial neural networks (ANNs) have been gradually proposed and promoted.Nonlinear autoregressive neural networks (NARNNs), hereafter referred to as NARs, approach nonlinear regression through neural networks and can generalize and deal with high-dimensional nonlinear regression estimation [9][10][11][12].In addition, long short-term memory (LSTM), an improved recurrent neural network (RNN), has brought revolutionary changes to various fields.LSTM has powerful feature extraction and representation capabilities, is successful in processing and predicting long and lagging data in time series and can compensate for the defect of vanishing gradients in RNNs [13][14][15].Linear statistical models and nonlinear neural network models have their own advantages and disadvantages in time series modelling.The combination of the two models to perform the fusion analysis of infectious disease time series may achieve good results [16].
Previously, data collection for infectious diseases was limited to the diagnosis stage, with issues such as incomplete coverage and poor timeliness.However, the factors influencing the occurrence of infectious diseases are complex, and the traditional monitoring system is ineffective for tracking new infectious diseases.It is now possible to obtain and explore internet data as internet technology advances [17].The effective use of internet health data may provide high application value, potentially improving the effect of infectious disease early warning research [18].There are currently some precedents in the use of internet data in the prediction of infectious diseases [19][20][21].The findings highlight the importance and relevance of internet data in the prediction of infectious diseases.
Feature selection and hyperparameter optimization are necessary steps for machine learning methods to achieve good accuracy, but traditional algorithms often have problems, such as slow convergence speeds and the tendency to easily fall into local optima [22].Evolutionary neural networks (ENNs) are neural network models based on evolutionary computing and neural networks [23].As a result, in this paper, the concept of evolution is employed to improve the efficiency of hyperparameter estimation, and the coronavirus herd immunity optimizer (CHIO) is used to adjust the hyperparameters [24,25].CHIO is a metaheuristic algorithm that was proposed in 2020 and inspired by social distance and a population immune strategy.When the proportion of immunized individuals gradually increases and reaches the group immune state, susceptible individuals are better protected.The metaheuristic algorithm based on evolutionary thinking can effectively improve the optimization accuracy while reducing the optimization time compared to those of the traditional grid search method.
We construct a new deep evolutionary fusion neural network (DEFNN), which aims to fully extract the information of time series by using various types of models.We also select six national or regional external datasets to study the generalizability of the DEFNN.Our method has the following innovations.(1) Multiple methods: We develop a new DEFNN prediction model that combines linear and nonlinear methods based on the residual method and execute it using a meta-heuristic algorithm.The combination of evolutionary ideas can account for the benefits of various types of methods and efficiently search for the optimal hyperparameters.(2) Multisource data: We include infectious disease data as well as internet data, modifying the infectious disease data prediction results and improving the model's prediction ability on the disturbed part.(3) Application value: We are the first to apply the DEFNN model to benchmark data and national and regional external test sets.This model has good prediction accuracy and generalizability.The DEFNN and DEFNN + models constructed in this paper enrich and supplement the methodological research content of infectious disease prediction, serving as new benchmarks for future infectious disease prediction and providing a reference for predicting the incidence of various infectious diseases.

Methods
Data preprocessing, single model construction, multiple method correction, multisource data correction, generalization performance verification, and other steps are covered in this paper.Figure 1 depicts the technical path of this method.

Infectious disease data
Seven infectious disease datasets were selected in this paper.The reference group data were the national hepatitis C incidence rate data from the National Health Committee.The national external data included hepatitis B, tuberculosis and brucellosis data, which were obtained from the official website of the National Health Commission.Thr regional external data included hepatitis C-R, hepatitis B-R and varicella-R data, which were obtained from Chongqing Health Commission.The basic information of the datasets is shown in Table 1.The data in this study were reasonably collected, and there were no missing data.
In addition, population data were sourced from China's annual statistical yearbooks, and geographic data were sourced from the National Basic Geographic Information Database.
Fig. 1 Flowchart of this research.The single models include linear (SARIMA) and nonlinear (NAR and LSTM) models.In NAR, three different algorithms are selected for optimization, including the Levenberg-Marquardt (LM), scaled convergent gradient (SCG) and Bayesian regulation (BR) algorithms, which are defined as NAR1, NAR2 and NAR3, respectively Table 1 The basic information of the infectious disease datasets R stands for regional dataset, the time span count (month) is in parentheses, the official website refers to the official website of the Chinese Health Commission, and the database refers to the database of the Health Commission

Internet data
The internet data are sourced from the Baidu index platform, which mainly focuses on the search frequency and click frequency of key texts.The internet retrieval keywords are shown in Table 2.The Baidu information index is used to measure the attention and popularity of a specific news topic in a Baidu search over a certain period.It is based on big data analysis and search click-through volume, reflecting the popularity and level of attention of a specific news topic during a specific period.The basic information for feature extraction and selection of internet data is shown in Table 3. Feature extraction is performed on each index time series, 22 representative features are selected for each index, and variance selection is performed with a threshold of 1.0.The independent variable contains 4 key texts.After extracting time series features from each key text, 22 of the most critical features are selected [26], and 88 features are obtained.Finally, the numbers of features included in hepatitis C, hepatitis B, tuberculosis, and brucellosis datasets are 16, 18, 21, and 22, respectively.Afterwards, we select a decision tree to fit and correct the prediction error of the selected features for the current month.After extracting features from internet data, we complete the conversion from day to month.After conversion, both the independent and dependent variables are monthly data.
The data acquisition of infectious diseases involves the following steps.First, keyword cooccurrence analysis, expert consensus, and experience are used to determine the highly relevant keyword information of the infectious disease to be analysed through the  database.Second, search engines are used to match text data and obtain keyword text index information.The search engine here specifically refers to the Baidu Search Index website, which is beneficial for representing social hotspots and the real-time living conditions of residents.Then, time series feature extraction is performed on the monthly index information to obtain the time series characteristics of each index for each month.Finally, we extract the 22 most important time series features [26,27] and incorporate them into the model for analysis.Catch22 is a 22 time series feature provided by the feature library in the Hctsa toolbox.This feature set is one of the small time series feature sets with strong predictive ability.By using the Catch22 feature set, we can effectively reduce the computational complexity and avoid feature redundancy without affecting the predictive performance of the final model.We divided data preprocessing into two parts to improve the convenience of data processing, accelerate the model's convergence speed, and avoid the impact of different dimensions on the accuracy of the results.For linear models, the logarithmic change method was used, and for nonlinear models, the standardization method was used.Thus, the model can conform to a standard normal distribution after data processing.
Furthermore, the model construction, evaluation index, and calculations were all based on MATLAB 2021b, and the geographic information map was drawn using Arc-GIS 10.8.1.

Fusion of multiple methods
As shown in Fig. 2, the seasonal-trend decomposition procedure based on loess (STL) is a representative time series decomposition algorithm that can divide time series into trend terms, periodic terms, and disturbance terms.The traditional linear time series method is useful for revealing the change rule of the trend and periodic terms, but it does not analyse the disturbance term, which introduces errors into the prediction results.This paper proposes two creative correction methods to better predict the change law of the disturbance term, namely, "multiple methods" and "multisource data." We examined the nonlinear methods represented by NAR and LSTM, which have a good learning effect for nonlinear characteristics and can effectively evaluate the time series fluctuation state.As a result, we employed nonlinear methods to alter the outcomes of linear time series methods.Our modelling process incorporates the concepts of "fusion" and "evolution," with the goal of overcoming the limitations of a single method.We used the residual sequence from the linear model as the training set for the nonlinear model, optimized the hyperparameters using deep neural evolution, and then built the DEFNN.The time series prediction model was built using the concept of fusion evolution.The linear and nonlinear model prediction result sequences were fused, and the coronavirus herd immunity optimizer (CHIO) was introduced to optimize the model hyperparameters.Then, the optimization results were obtained [24,25].
The model was trained using reference group data (hepatitis C dataset).During the training process, CHIO was used to perform hyperparameter optimization, after which the best hyperparameters of various models could be determined.Details can be found in Table 4.After training, the national and regional external datasets were used to validate the optimal model's generalization performance.Finally, we built the deep evolutionary fusion neural network, which included SA-NAR-1, SA-NAR-2, SA-NAR-3, and SA-LSTM, and tested it.

Fusion of multisource data
The traditional incidence rate report has a lag effect, which affects the time series prediction accuracy.Internet data are useful for providing proactive information and overcoming the lag effect associated with traditional data.As a result, we used internet data to modify the results.The residual fusion process produces the prediction results, and the model performance is compared.Simultaneously, network data from specific infectious diseases can be used for the regression prediction of the residual part, resulting in an accurate prediction.Finally, the DEFNN + was built on top of the DEFNN.The distinction is whether internet data were used.Figure 3 depicts the multisource data fusion process.
In this paper, multisource data fusion was implemented on the national dataset to study the improvement effect of network big data correction on the prediction results.Regression analysis was carried out with internet data as the independent variable and the residual after model fitting as the dependent variable.We also carried out an ex-right operation on the internet data, i.e., dividing the Baidu index by the overall index activity of the month, to eliminate the impact caused by the fluctuation of the number of search engine users.

Evaluation metrics
We used five common evaluation indicators, including the mean square error (MSE), mean absolute error (MAE), root MSE (RMSE), mean absolute percentage error (MAPE) and R-square (R 2 ) metrics.The first four indicators represent the model fitting error, while R 2 represents the model fitting trend.The smaller the error value is, the better the fitting performance of the model, and the greater the R 2 is, the stronger the ability of the model to predict the trend of actual data.In addition, to better evaluate the accurate proportion of rising or falling trends in the prediction of infectious diseases at each time node, this paper creatively proposes the concept of the accuracy of trend prediction (ATP).n i represents whether the predicted trend is accurate dur- ing the ith prediction and can be expressed as where the last value of y 0 is the validation set.When i > 0, y i represents the test set sequence, ŷi represents the prediction results of the model on the test set, and y i = y i−1 .
Hence, the ATP can be defined as where N represents the number of test set total time nodes.
To comprehensively utilize the MSE, R 2 , and ATP evaluation indicators, we constructed a joint objective function named loss J , and its calculation method is defined as ( 1) Our objective function ensures that the sought parameters minimize the error (MSE) and maximize the degree of fit (R 2 ) between the model's predicted data and actual data, thereby comprehensively balancing the accuracy of the built model in predicting data and predicting trends in infectious disease incidence.When using optimization algorithms for hyperparameter optimization, the use of joint objective functions can simultaneously provide direction and step size guidance for the next iteration of the algorithm by the MSE, R 2 , and ATP, avoiding single indicator variation errors or premature convergence affecting the final optimal hyperparameter determination.In addition, when constructing the final DEFNN + model, to better perform residual fusion and avoid overfitting, we also used fivefold cross validation to adjust the model parameters.

Time and spatial descriptions
Here, the temporal and spatial epidemic characteristics of the seven datasets are described.Figure 4 shows the time trend diagram, and Fig. 5 shows the spatial distribution diagram.The overall development trend has certain seasonal characteristics.

Prediction accuracy of the DEFNN
Five single models and four fusion models were selected for training and testing on the reference group data.The single models involved in the comparison include SARIMA, NAR1, NAR2, NAR3, and LSTM, and the DEFNN models involved in the comparison include SA-LSTM, SA-NAR1, SA-NAR2, and SA-NAR3.The prediction effect of each model on the reference group data is shown in Table 5, Figs. 6, and 7, and only the test set data are shown.
For the single models, the MSE, MAE, RMSE and loss J values of SARIMA are all lower than those of the other models, while the R 2 value is better than those of the other models; hence, it is the best single model.Compared with the single models, the indicators of the DEFNN are improved to varying degrees.On the test set, the MSE, MAE, RMSE and loss J values of the SA-LSTM model are all lower than those of the other models, while the R 2 value is better than those of the other models.The loss J value of SA-LSTM is 0.4919.The SA-LSTM model is conducive to improving the overall prediction ability of the SA-LSTM model and reducing the gap between the predicted and actual values.SA-LSTM is the optimal model for our infectious disease trend prediction task.

Prediction generalization of the DEFNN
The optimal SA-LSTM model was selected using the reference group data.To analyse the robustness and generalizability of the optimal model on various datasets, three national external validation datasets and three regional external validation datasets were selected for validation.See Table 6 and Fig. 8 for the results.The loss J values of SA- LSTM are 0.9666, 1.2437, 0.2472, 0.7239, 1.4026 and 0.6868. (3)

Prediction accuracy of the DEFNN +
We also introduced additional internet data on the basis of SA-LSTM.Table 7 and Fig. 9 display the final prediction results.The results show that using the Baidu index to correct the results and obtain more accurate prediction results is effective.The loss J of the DEFNN + for hepatitis C, hepatitis B, tuberculosis, and brucellosis are 0.4212, 0.8218, 1.0331, and 0.8575, respectively, on the four national datasets.

Comparison with previous studies
We compared this method with the most advanced methods in the previous literature to better illustrate the performance of this model.The results show that the prediction performance of this method is better than that of the previous advanced methods.See Table 8 for details.

Discussion
In the analysis and prediction of the incidence rates of infectious diseases, we compare five single models and propose a fusion evolutionary idea based on the results.We build and select the optimal fusion model and use the national external data and regional external data to verify the optimal model.SARIMA is a typical representative of linear prediction models.NAR is a representative model of machine learning and has good performance in classification and regression.LSTM is a deep learning model suitable for nonlinear regression [34].Our DEFNN When there is a nonlinear part in the prediction data, it can be explained by the neural network, so the effect of the DEFNN is obviously better than that of a single model.In addition, we introduce a function to comprehensively weigh the accuracy of the model in the prediction of infectious diseases, named loss J , and then make a more comprehen- sive evaluation of the performance of the model.
Our results show that SA-LSTM not only has good prediction accuracy but also has good robustness and generalizability in external verification.The reason for these results is that the SARIMA model cannot capture the nonlinear part of time series data, and the evaluation effect of the NAR and LSTM models on the linear part is also limited [35].The DEFNN can not only make effective use of the seasonal prediction advantages of SARIMA but also explain the nonlinear trend in the dataset [36].In addition, the evolution process based on a metaheuristic algorithm is conducive to improving the accuracy of the neural network model hyperparameter search [37][38][39].The verification results of the external dataset in this paper show that SA-LSTM is suitable not only for the national dataset of a variety of infectious diseases but also for the regional dataset of a variety of infectious diseases.Internet retrieval data, as represented by the Google influenza index, have gradually been applied to the prediction of infectious diseases [40][41][42].Internet search data can effectively reflect population internet behaviour and reasonably measure the population's spontaneous and proactive search behaviour prior to the onset of infectious diseases.As a result, we introduce the research concept of multisource data fusion and then construct the DEFNN + on the basis of the DEFNN.The DEFNN proposed in this paper is applicable to a wide range of seasonal infectious diseases, has high prediction accuracy, and can be used as a new benchmark in the field of infectious disease trend prediction.In the future, the benchmark that we have created can be used to predict the trend   of infectious diseases in more types and regions, providing a better foundation for infectious disease prevention, control, and evaluation.

Conclusion and future work
In summary, the DEFNN is constructed by using the idea of neuroevolution and fusion.
The optimal SA-LSTM model is superior to other models in reference group data prediction and has a good generalization performance on the external test set.The DEFNN + can significantly improve the prediction performance after incorporating internet data.The DEFNN model proposed in this paper has good prediction performance, can provide strong future predictions, and has a strong guiding role in the prevention and control of infectious diseases.It is suggested that this model be extended to predict a variety of seasonal infectious diseases to guide their prevention.
Although the design of this study is reasonable and strictly implemented, there is still some room for improvement.We look forward to the following future research directions: (1) In the future, multidimensional and wide-ranging analyses will be carried out on the characteristics of spatial distributions, population distributions and pathogenic factors to better implement early warnings.(2) In the future, we need to continue to explore the optimization method of model hyperparameters and improve the model effect.(3) The DEFNN + results indicate that the addition of the Baidu index can effectively correct the results and obtain more accurate prediction results, but the effect on the overall trend analysis is limited.As a result, improving the utilization of internet data in the future is required to obtain more accurate prediction results.

Fig. 3
Fig. 3 Fusion of multisource data (time series data + internet data)

Fig. 4
Fig. 4 Time trends of the datasets.a: Hepatitis C; b: Hepatitis B; c: Tuberculosis; d: Brucellosis; e: Hepatitis C-R; f: Hepatitis B-R; g: Varicella-R

Fig. 5
Fig. 5 Spatial distributions of the regional external datasets.Our base map is based on the standard map with the review number GS (2019) 3333 downloaded from the Standard Map Service website of the National Bureau of Surveying and Mapping Geographic Information.The base map has not been modified.a ~ b: Hepatitis C-R, c ~ d: Hepatitis B-R, e ~ f: Varicella-R; a, c, e: Geographical Plots, b, d, f: Kernel Density Plots

Fig. 7
Fig. 7 Prediction effect of each fusion model on the reference group data

Fig. 8
Fig. 8 Prediction results of the optimal model on the national external validation data.a: Hepatitis B; b: Tuberculosis; c: Brucellosis; d: Hepatitis C-R; e: Hepatitis B-R; f: Varicella-R

Fig. 9
Fig. 9 Model effect of the DEFNN + on each dataset.A: Hepatitis B; b: Tuberculosis; c: Brucellosis; d: Hepatitis C-R

Table 2
Search keywords of internet data

Table 3
The basic information of the internet dataThe Baidu index and consulting index are derived from the Baidu search engine.The specific extraction process to determine the number of time series features is referred to as Catch22

Table 4
Various model hyperparameters and their optimal values

Table 5
Prediction effect of the single and fusion models on the reference group data SARIMA, NAR1, NAR2, NAR3 and LSTM are single models, while SA-NAR1, SA-NAR2, SA-NAR3 and SA-LSTM are DEFNN models.MSE: mean square error, MAE: mean absolute error, RMSE: root mean square error, R2: R-square, ATP: accuracy of trend prediction Fig. 6 Prediction effect of each single model on the reference group data

Table 6
Prediction results of the DEFNN on the national external validation data Hepatitis B, tuberculosis and brucellosis data are national external data, while Hepatitis C-R, Hepatitis B-R and Varicella-R data are regional external data.MSE: mean square error, MAE: mean absolute error, RMSE: root mean square error, R 2 : R-square, ATP: accuracy of trend prediction

Table 7
Model effect of the DEFNN + on each dataset

Table 8
Comparison with previous studies